PASTA: Ultra-Large Multiple Sequence Alignment
نویسندگان
چکیده
In this paper, we introduce a new and highly scalable algorithm, PASTA, for large-scale multiple sequence alignment estimation. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy of the leading alignment methods on large datasets, and is able to analyze much larger datasets than the current methods. We also show that trees estimated on PASTA alignments are highly accurate – slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is very fast, highly parallelizable, and requires relatively little memory.
منابع مشابه
Supplementary Online Material PASTA: ultra-large multiple sequence alignment
We introduce PASTA, a new method for multiple sequence alignment of datasets with up to 200,000 sequences in [3]. Here we provide supplementary information not provided in the main paper. We give exact commands used for running the experiments, we provide extra results that did not fit in the main paper, and we provide some supplementary discussion of the results.
متن کاملPASTASpark: multiple sequence alignment meets Big Data
Motivation One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine ...
متن کاملAn Application of the ABS LX Algorithm to Multiple Sequence Alignment
We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...
متن کاملUltra-Conserved Elements in Vertebrate and Fly Genomes
Our analyses of ultra-conserved elements are based on multiple sequence alignments produced by MAVID [Bray and Pachter, 2004]. Prior to the alignment of multiple genomes, homology mappings (from Mercator [Dewey, 2005]) group into bins genomic regions that are anchored together by neighboring homologous exons. A multiple sequence alignment is then produced for each of these alignment bins. MAVID...
متن کاملLarge-Scale Multiple Sequence Alignment and Phylogeny Estimation
With the advent of next generation sequencing technologies, alignment and phylogeny estimation of datasets with thousands of sequences is being attempted. To address these challenges, new algorithmic approaches have been developed that have been able to provide substantial improvements over standard methods. This paper focuses on new approaches for ultra-large tree estimation, including methods...
متن کامل